-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Spec][MOE][Internal Op] Specification of MOE internal operation #32255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Spec][MOE][Internal Op] Specification of MOE internal operation #32255
Conversation
| # Experts computation part (GEMM3_SWIGLU) | ||
| x_proj = matmul(reshaped_hidden_states, weight_0, transpose_a=False, transpose_b=True) | ||
| x_proj2 = matmul(reshaped_hidden_states, weight_1, transpose_a=False, transpose_b=True) | ||
| swiglu = swish(x_proj, beta=expert_beta) | ||
| x_proj = x_proj2 * swiglu | ||
| down_proj = matmul(x_proj, weight_2, transpose_a=False, transpose_b=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU plugin request is to transpose those weights at conversion stage, so the MatMul both transpose_a/b attrs should be False at this point:
| # Experts computation part (GEMM3_SWIGLU) | |
| x_proj = matmul(reshaped_hidden_states, weight_0, transpose_a=False, transpose_b=True) | |
| x_proj2 = matmul(reshaped_hidden_states, weight_1, transpose_a=False, transpose_b=True) | |
| swiglu = swish(x_proj, beta=expert_beta) | |
| x_proj = x_proj2 * swiglu | |
| down_proj = matmul(x_proj, weight_2, transpose_a=False, transpose_b=True) | |
| # Experts computation part (GEMM3_SWIGLU) | |
| x_proj = matmul(reshaped_hidden_states, weight_0, transpose_a=False, transpose_b=False) | |
| x_proj2 = matmul(reshaped_hidden_states, weight_1, transpose_a=False, transpose_b=False) | |
| swiglu = swish(x_proj, beta=expert_beta) | |
| x_proj = x_proj2 * swiglu | |
| down_proj = matmul(x_proj, weight_2, transpose_a=False, transpose_b=False) |
cc: @yeonbok
…ts/operation-specs/internal/moe.rst
| @@ -0,0 +1,151 @@ | |||
| .. {#openvino_docs_ops_internal_MOE} | |||
|
|
|||
| MOE | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let us not use MoE name because we can use it for external operation and for real MoE operation. Now it is a sort of FusedExperts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The routing weights and indices are provided as inputs, so the core MOE idea is preserved, final multiplication and ReduceSum are included.
I would keep the name as is, to make current purpose clear.
The MOE internal op can be refactored as needed in the future, also possibly extended with Router.
| .. code-block:: py | ||
| :force: | ||
|
|
||
| # Common part: Reshape hidden states and prepare for expert computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to add router_topk_output_indices into this logic. It will show how weights are prepared. Now it is not clear how router_topk_output_indices is used in the specified operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! Thank you, Kasia. Left a couple of comments,
...articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal/moe.rst
Outdated
Show resolved
Hide resolved
...articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal/moe.rst
Outdated
Show resolved
Hide resolved
...articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal/moe.rst
Outdated
Show resolved
Hide resolved
...articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal/moe.rst
Show resolved
Hide resolved
...articles_en/documentation/openvino-ir-format/operation-sets/operation-specs/internal/moe.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Tatiana Savina <[email protected]>
… experts into MOE (#32183) ### Details: This transformation is for compile time and is not enabled by default, it should be enabled in each plugin with MOE plugin support. Example registration of the fusion transformation for CPU plugin: 41145cf - Fuse vectorized MatMul experts into MOE for 3GEMMs and 2GEMMs pattern: ``` class ov::pass::VectorizedExpertsFusion : public ov::pass::GraphRewrite { public: OPENVINO_GRAPH_REWRITE_RTTI("VectorizedExpertsFusion"); VectorizedExpertsFusion() { add_matcher<ov::pass::FuseVectorizedMOE2GEMM>(); add_matcher<ov::pass::FuseVectorizedMOE3GEMM>(); } }; ``` - Add internal MOE op MOE internal op spec PR: - #32255 ## Preliminary requirements (offline transformations): - Patterns match MatMul (transpose_a=False, transpose_b=**True**), for batched MatMuls preliminary update of MatMulConstTransposesExtraction is needed: - #32378 - Fusion of separate MatMul experts into vectorized (batched) MatMul: - #32199 ### Tickets: - transformation (and fusion details): 173663, op: 171913
Details:
they will not appear in the converted model public IR
Describes MOE used in PR:
Tickets: